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Abstract 

Advances in statistical learning theory have resulted in a multitude of different 
designs of learning machines. But which ones are implemented by brains and other 
biological information processors? We analyze how various abstract Bayesian learners 
perform on different data and argue that it is difficult to determine which learning- 
theoretic computation is performed by a particular organism using just its performance 
in learning a stationary target (learning curve). Basing on the fluctuation-dissipation 
relation in statistical physics, we then discuss a different experimental setup that might 
be able to solve the problem. 

1 Introduction 

Learning based on experience (variously known as sensing, information processing, or 
adaptation) is ubiquitous on all scales in biology. For example, on the molecular scale, 
the Lac operon in E. coli learns the lactose concentration to produce /3-galactosidase (the 
lactose-metabolizing enzyme) in proper quantities flCohn and Horibata , 1959D - Similarly, 



in the sensory system, the retinal phototransduction cascade uses the information in the ar- 
rivals of photons to learn the instantaneous light intensity and thus the current visual scene 
( Detwiler et all ^OOOp . Additionally, it also learns the ambient light level (to adapt to it) and 



the temporal correlations (to estimate motion) ( Reichardf , |1961 ; de Ruy ter van Steveninck , 



Personal communication). On the scale of cellular (neuronal) networks, learning, memory, 



and adaptation in the neural code are a text book knowledge [see e. g., ( Brenner et all |2000| ; 



Fairhall et al.[ , [2001] )] . At yet larger scales, experiments on rodents are revealing how they 



learn and respond to changes in their environments dGallistel et al. , 2001 ); this is a simple, 



albeit quantifiable, example of the general phenomenon we call "learning" in everyday 
life. Finally, we may also view evolution as an example of learning, where entire species 
adapt to the world by means of natural selection. 

The creativity of theorists matches that of the Nature, and the number of various learn- 
ing paradigms different in their goals, assumptions, methods, and performance guarantees 
is astonishing — too large to enumerate here. Fortunately, it is possible to build uniform 
foun dations for many of these learning machines dVapnikl |1998|; |Nemenman|, [2000|; Bialek 



et al., |2001D , and to find analogs among, say, Structural Risk Minimization ( Vapnik , 1998| ) 



and Bayesian flPressj , |1989[ ) models. However, while one might argue that biological sys- 
tems are (efficiently) implementing one of many abstract learning-theoretic computations 
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( |Attneav4 [1951 |Barlow| , $95% [19611 : |Atick| , [1991 ; |Bialek et alj , |200ll ) r it is often unclear which 
exact computation is performed in a particular case. For example, what is a learning- 
theoretic model equivalent to a ra t (|Gallistel et aT} , 2001)? or to a simple neural network 
that tries to maximize its reward ( Seungj , |2003| )? Answering such questions may explain 
some of animal behaviors, uncover which assumptions they make about the surrounding 
world, and establish quantitative limits on their learning performance. 

To attack the problem, one can construct a biologically plausible computing machine 
with a known learning-theoretic equivalent QRao , |2004D and then search for a structural 
similarity with a real leaving organism. We do not pursue this approach, but choose a 
more traditional route to establish the equivalence: comparison of performance of real 
creatures to that of abstract learning machines. As we will argue, analysis of paradigmatic 
learning curves is not always easy. Thus one of the most important results of the paper is 
a suggestion of a new protocol for making such comparisons. The intuition behind the 
suggestion comes from the famous Fluctuation-Dissipation Theorem (Ma, 1985 ). Based 
on this analysis and on plausible assumptions about statistics of natural stimuli, we also 
suggest that a particular learning-theoretic model might be better suited for biological 
learning than the alternatives, and thus it should be realized often in reality if optimization 
of learning is desired. 

To follow this route, we need to understand characteristics of learning within different 
mathematical models fairly well, and a large part of the paper is devoted to this. Analysis 
is done in the framework of unsupervised Bayesian learning of probability distributions 
since (a) evidently, Bayesian paradigm is relevant in neuroscience ( |Kording and Wolpert , 
2004 ), (b) as mentioned, different learning frameworks are often equivalent, and (c) ac- 
cording to Bialek et al. (2001), other learning problems usually can be reduced to unsuper- 
vised learning of distributions. Much of this first part of the paper is an abridged review, 
which follows the spirit and the notation of ( Bialek et al , 200 1| ) and often prefers clarity to 
mathematical rigor. We do not try to make the review self-contained, but instead want to 
elucidate and emphasize some important points that might have been of a lesser interest 
in other contexts and also to present some novel results, mostly developed in the Appen- 
dices. After these developments, we return to the main question of this work: how can one 
realistically determine an equivalent learning-theoretic model for a biological organism? 



2 The basics of learning 



Learning machines should be powerful enough to explain complex phenomena. However, 
when data is scarce, this power leads to overfitting and poor generalization. Thus a bal- 
ance must be struck between the abilities to explain and to overfit, and this balance will 
depe nd on the amount of data available. In accord, much of statistical learning theory (Jef- 



freys, |1936| : |Schwartz| , |1978| ; [fanest \L979p |Rissaner| |1989| ; Parke and Barron! |199Q| : iMacKayj 
|19^2| : |Balasubramanian| , $99% |Vapnik| , |1998| : [Nemenman| , p000| ; |Bialek et alj , [2001] ) has been 
devoted to putting the famous paradigm of William of Ockham, Pluralitas non est ponenda 
sine neccesitate, on firm mathematical footing in various theoretical frameworks. In par- 
ticular, in Bayesian formulation ( [Press| , |1989| ; |Bernar doL g003| ), we know now how proper 
Bayesian averaging creates Occam factors that punish for complexity and weigh posterior 
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probabilities towards those estimates among a finite set of parametric model families that 
have the best overall predictive power ( |Bialek et alj |2001[ ), but do not necessarily produce 



the best fit to the observed data. This has been called Bayesian model selection. 1 

The waters get murkier in a nonparametric or infinite parameter setting when the whole 
functional form of an unknown object is to be inferred. Bayesian nonparametric develop- 
ments generally parallel parametric ones, and techniques of Quantum Field Theory (QFT) 



in computations ( 


Bialek 


et al. 


|Nemenman and Bialek 




2002 



the two settings is unknown, and some results suggest subtle logarithmic differences be- 
tween the cases QHall and Hannan| , [19881 ; |Rissanen etaF] , |1992| : |Bialek et al] , p00T| ). 



We now review these and other Bayesian learning machines, and we start with an 
introduction of some important and useful quantities. 

Suppose we observe i. i. d. samples Xi,i = \ . . . N . For simplicity, we assume that x is a 
scalar, but this does not affect most of the discussion. We need to estimate the probability 
density that generates the samples. A priori we know that this density, Q(x\a), can be in- 
dexed by some (possibly infinite dimensional) vector of parameters a, and the probability 
of each parameter value is V(a). Then we define the density of models (solutions), at a given 
distance (dissimilarity, or divergence) D(a,a) = e away from the unknown true target a, 
which is being learned: 



p(e;a) = J da.V(at.) 5[D(a,a) — e] 



(1) 



For Bayesian inference of probability densities, the correct measure of dissimilarity is the 
Kullback-Leibler divergence, DK_^(a\\a) = J dx Q(x\a) \og[Q(x\a) / Q(x\a)} ( |Bialek et al. , 
2001), which has an important information-theoretic interpretation ( Cover and Thomas) , 



1991J). However, in other situations different choices of D can and should be made. 

Performance of a Bayesian learner is usually measured by the speed with which the 
posterior probability concentrates for N —>■ oo (the learning curve) and by whether the point 
of concentration is the true unknown target (consistency). These characteristics illuminate 
the importance of p, as it relates to both of them. First, it has been proven that if, for e — » +0, 
the density, p(e; a), is not zero, then the Bayesian problem is consistent ( Nemenman , 2000] ; 
Bialek et alj , |200ip . Intuitively, this is because, for large density, statistical fluctuations 
of the sample and of the estimated parameters result in small Dkl(&\ (estimate), making 
convergence to the target almost certain. 

Relation of p to the learning curve is more complicated. We can calculate the average 
(over samples) Occam factor for a given target (the generalization error, or the fluctuation 
determinant) to the leading order in 1 /N: 



T>(ot;N) — log J dep(e;a)e 



jVe 



(2) 



This is the term that emerges as the penalty for complexity in Bayesian model selection 
( Balasubramaniarj , 1997 ; Bialek et"aL| , |200l[ ). If averaged over a, the Occam factor becomes 



1 With the creationism-evolution tension mounting in teaching of biology in the U. S. schools, it is amusing 
to see how two friars, William of Ockham and Thomas Bayes, teamed up with modern day mathematicians to 
produce, in my view, the clearest formulation of the theory of learning from past experiences. If this approach 
results in a better understanding of biological designs, the situation will be even more peculiar. 
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predictive information flBialek etaL , 2001 ), which is the average number of bits that N sam- 
ples provide about the unknown parameters, 



daV(a)V(a;N) . 



(3) 



Finally, one can define the universal learning curve, which measures the expected Z?kl be- 



tween the target and the estimate after N observations ( |Bialek et alj , |2001[ ). Up to the first 
order in the large parameter N , this is 



A(a;JV) 
A(N) 



dV(a;N) 
dN 



daV(a)A(a;N) 



dL 



pred 



dN 



(4) 
(5) 



Many of these quantities, especially I pre d/ are also natural objects when analyzing com- 
plexity of a time series ( |Bialek et al| |2001[ ). 



3 Different models of learning 

Since one of the goals of this work is to investigate if learning machines can be discrimi- 
nated by means of their learning performance, specifically A(iV), here we discuss how A 
depends on N for different scenarios. 

3.1 Learning in a finite set of parameters 

Consider a setup where a can take M discrete values a\, a 2 , . . . , clm with a priori probabil- 
ities Vi, V21 ■ ■ ■ , Vm> and their divergences from the target a\ are = d\ < d 2 < ■ • ■ < ^m- 
The density is p(e; a\) = YldLi 'Prfidi — e). For N — > 00, we have 

AI 

V( ai ;N) = -log^2Viexp[-Ndi] « -log Pi -V 2 /V 1 exp[-Nd 2 ], (6) 

8=1 

A(ai;A0 « d 2 V 2 /V 1 exp[-Nd 2 ]. (7) 

So exponential learning curves (and asymptotically finite V and / pre d) correspond to learn- 
ing a possibility in a finite set. Similarly, we can construct models with A(A r ) oc 1 /N u , v > 
1, and they will also have asymptotically finite I pre d. 

3.2 Finite parameter learning 

Now let the target probability density Q(x\a), or a model, belong to a set of densities A, a 
model family , that can be indexed by a vector of parameters a £ A, dim a = K < 00, and 
Va € A, V(a) > 0. Then if A is not compact, or if the KL divergence between a and the 
boundary of A is larger than e, then the density of solutions for such A'-parametric family 
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is dBialek et all gXHl) 



p(e; a) « "P(a|r) 



27r ^/2 e (K-2)/2 



where 



<9 2 -D K lH|oO 



da.nda u 



(8) 



(9) 



iV.? 7 ^ is the Fisher information matrix (Cover and Thomas, 1991); its eigenvectors are the 
principal axes of the error ellipsoid in the parameter space, and the (inverse) eigenval- 
ues are variances of parameter estimates along each of these directions. The pref actor 
2n K / 2 /T(K /2) is the area of the -ftT-sphere, and it has to be multiplied by the fraction of the 
sphere that is inside A if the latter is (semi)compact. Eq. (Jsj) now gives 



I pred (N) « V(a,N)R>K/2]ogN, 
A(JV) « A (a, N) ps K/(2N). 



(10) 
(11) 



The situation changes slightly if Q A (here Q is the target density), and the prior 
assumptions about the world are wrong. Then we find the best approximation to the 
target within A, a = &rgm.m a& A Dkl(Q\\(*), arid define the distance between Q and A, 
Da{Q) = D^{Q\\a.) [this is similar to the I-projection ( |Csiszar| , |1975| ), but the order of 
arguments in Dkl is different]. In this case, the model density is zero for e < Da{Q), and 
the estimate concentrates near dasiV^ oo. Thus, if the radius of curvature of A is much 
larger than e, and Da(Q) is also small, then Eqs. (§, |TT|) generalize to 



p(e;a) 
A(Q,N) 



I 0, e < D A , 
D A {Q) + K/(2N). 



(12) 
(13) 



3.3 Nested finite parameter models 

Suppose now the target Q(x) that generates the observations belongs to one of R model 
families, A r , r = 1 . . . R, with Prob(Q G A r ) = V(r). Models in each of the families are 
indexed by parameters oS- r \ dimo^ = K(r) < oo, so that the density of observing x in 
a given model is Q r {x\a^). Within each family, the parameters are a priori distributed 
according to P(a^ |r). 

We will assume that the families are nested. By this we mean that Q r (x\a^) = Q(x\a) 
are independent of r, but that in each family the values of a M , \i > K(r), are identically 
zero. Further, the nonzero parameters have the same a priori distributions in all families: 



p(a M ) , V < K(r) 
<5(a M ), n>K{r) 

R 

T{ct\r) = HvMr) (15) 



2 A different scaling dimension dx may appear in these formulas instead of K, the number of parameters. 
For example, for a redundant para meterization, dx < K. Opposite situations, dn > K and even dx — > oo, 



are also possible ( Bialek et al. , 2001 ) 
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Thus a parameter is "switched on" (or "activated") when r reaches r M = min. r {r : 
K(r) > fi}. Discussion of such nested models has been current in Bayesian dBernardoj , 
|20Q3| ; |Raftery and Zheng| , |2003[ ) and frequentist ( [Neter et aL| , |1996| ) literature for many years. 



However, we are unaware of any comprehensive analysis relevant to important questions 
analyzed in our current presentation, such as those in Appendices |A|-0. 

If R — > oo, then we require that the union of all families forms a complete set, so that 
every sufficiently smooth probability density can be approximated arbitrarily closely by 
some member of the union (if needed, this definition can be made more precise). 

For simplicity, in this paper we focus on 3 

p(a^)=M(0,af l ), (16) 
c> = cr~@ , f3 > 0, c = const , (17) 

where J\f(a, b) denotes a normal distribution with the mean of a and the variance of b. 
In particular, (3 = corresponds to the same in-family a priori variances for all active 
parameters. This is common when discussing Bayesian model selection. 

While these priors describe a set of parametric models, another view is also possible. 
The joint distribution of r, a, and {x} is P({x}, a, r) = Q({x}\a)V(a\r)V(r), which results 
in 

P({x},a) = ^Q{{x}\a)V{a\r)V{r) = Q{{x}\a.)^V{a.\r)V{r) 

r r 

= Q({x}\a)P(a) , (18) 

where the last equation defines V(a), the overall prior over a. Unlike V(a\r),V(a) is not 
factorizable and is not differentiable at zero for any a M , fi > K{\). Thus the nested setup 
may be viewed as inference in a combined model family with K(R) parameters. In par- 
ticular, for R and K(R) — > oo, the learning problem has a countable infinity of parameters 
leading to the common assumption of equivalence with the nonparametric inference. 

It is of interest to calculate the combined a priori mean and variance of a. Integrating 
over all a v , v ^ \i, we get the combined prior for a M 

r<r M r>r M 
By Eq. (|16|), the a priori means of all parameters are zero, and the variances are 

r>Tfi 

Thus the bare variance is "renormalized" by the probability to be in a family, in which 
the parameter is nonzero. An interesting special case is 

V{r) oc r~ 7 , 7 > 1, R oo, (21) 
ra = fi. (22) 



3 Nestedness, completeness, and normality of the priors are needed only for comparison with models dis- 
cussed later, and they are not essential for Bayesian learning. 
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Then the a priori variance gets a simple form 



CO 



(<5aJ)oc^X; r " 7 ~^~ 7+1 - < 23 ) 



Thus depends as much on the bare variance as on the speed of decay of V{r^). This 

suggests that the learning properties of the nested setup will depend equivalently on (3 and 
7. In fact, as shown in Appendix this is not true: while behavior of p(a fM ) is important, 
any reasonable choice of V(r) does not effect success of the learning. 

From Eq. (^) we can now evaluate the model density and the learning curve for the 
nested setup. For each value of a and r, we can find the model 6t r = argmin Qe A r -Dkl(o:||qO 
that best approximates a in A r , and define D r (a) = Dkl(&\ \oi), the distance between a 
and A r . If a G A r , then a = a, and D r (a) = 0. However, if a A r , then D r (a) > 0. We 
then have: 

P (e;a)= £ V(r)V(& r \r) . „./ ^1 • (24) 

r:D%<e WO/2] V det ^(r) 

The learning curve in this scenario strongly depends on the target. Let a have f active 
modes (with f determined according to V(r)), and let each of these modes have an ampli- 
tude ~ a (that is, (3 = 0). Then D r (a) is either exactly zero (for r > f), or large (for r < f). 



So Eq. ( |24D becomes 



27T K(r)/2 [Jf(r)-l]/2 

p(e 0;a typ;r -) ~ ^P(r)P(a r |r) (25) 
^ T[K{t)/2) v/det^ (r) 

where o;typ;f is a distribution typical in ^4 f . This is dominated by r = f, and for N S> K(f) 
the learning curve is 

K(r) 

MN) « (26) 

It is now clear that averaging over P(f) is not very informative. Note also that, for N < f, 
the learning curve goes through a cascade of K(r)/N behaviors, 1 < r < f, and changes 
of the pref actor of the iV -1 scaling correspond to activations of new parameters, which 
happen rather abruptly (cf. Appendix |A|). 



3.4 Nonparametric learning 



Nonparametric learning usually refers to inferring a functional form of a probability den- 
sity Q(x), or rather of </>(sc) = — logQ(x), with some smoothness constraints on it. The 
constraints may be in the form of bounding some derivatives of Q or <ft, which was the 
choice of Hall and Hannan (198S) and Rissanen et al. (1992). 4 Alternatively, in the Bayesian 



4 These authors used histogramming density estimators, which have no hierarchy of model families; this is 
especially true for Rissanen et al. ( 1992 ), who allowed locally varying bin widths. Therefore, these techniques 
can not be referred to as nested parametric methods. On the other hand, they allow an arbitrarily precise fit 
to any probability density and may require an arbitrarily large number of break points and density values for 
complete specification. This is the reason for treating them as nonparametric. 
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framework followed here, the constraints may be incorporated into a functional prior that 
makes sense as a continuous theory, independent of discretization of x on small scales. For 



x in one dimension, the minimal and the most common choice is (Bialek et al., 1996; Aida, 



1999| : INemenman and Bialekt gOOJ ; |Lemmj |2002|) 



^ [</>(»] = 



dx 



dx 1 ! 



— I dxe 

h 



(x) 



(27) 



where rj > 1/2, Z is the normalization constant, and the <5-function enforces normalization 
of Q. The hyperparameters £ and i] are called the smoothness scale and the smoothness ex- 
ponent, respectively. Fractional order derivatives are defined by multiplying by the wave 
number to the appropriate power in the Fourier representation of <fi (we assume periodicity 
on [0,1)). 

T his prior is equivalent to specifying a 1-dimensional Quantum Field Theory^ Bialek 



et al., |1996| ; |Holy| , |19971 ; |Nemenman and Bialekj , [20021 ; |Lemm| , |2002| ), and QFT methods 
have been successful in the analysis. In particular, the maximum likelihood estimate of the 
distribution, Q*(x) = exp[— <ft*(x)], is given by the following differential equation 



£ 2V~ 



-27?" 



d 2r >4>*(x) 
dx 2r > 



NQ*(x) + ^25(x-x l ) = 0, 



(28) 



where the operator IRg shifts the phase of each Fourier component of its argument by ttO/2 
5 . The equation shows that derivatives of <j>* and Q* of order 2rj — 1 have step discontinu- 
ities. Thus, for 2?7 = 1 the classical solution itself, 4>* (x), is discontinuous, and for 2i] < 1 the 
singularities are even more severe. We may characterize sample-dependent fluctuations in 
Q* by D K l = f dxQ* (x) log Q\ (x) /Q* 2 (x), where Q\ and Q\ are saddle point solutions for 
different sample realizations. If Q* has, at least, step discontinuities at the sample points, 
and these points are random, then Dkl does not fall to zero as N grows. Therefore, the 
QFT setup becomes inconsistent at r/ = 1/2, even though Bayesian formulation is proper, 
and the prior can still be normalized by, for example, going to the Fourier representation. 
This is in contrast to the nested setup, where normalizable priors guarantee consistency. 

Bialek et al. ( 1996 , 2001[ ) have calculated the e — ► model density and the fluctuation 
determinant for different 77's. By noticing from Eq. ( ^8| that N and I can enter the solutions 
only in a combination N jl 2r> ~ x , we extend their results and recover correct dependence not 
only on rj (for 77 > 1/2), but also on I: 



p(e;4>) p 




V$;N) p 


a C[4>] ( 


A(4>,N) p 


. c[4>] 

2r)£ 2r i- 



B[4>] 



N 



£ e l/(2v-l) 
l/2r? 



N 



Pn- 



1/27?- 1 



(29) 
(30) 
(31) 



'For a comprehensive treatment of fractional differentiation the reader is referred to |5amko et al. (1987). In 



particular, the action of the phase shift operator Re may be calculated by the Wiener-Hopf method. 
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Here £ depends only on r\, and A, B, and C are some known related functionals that do 
not depend on I. These asymptotics kick in when N l/£. In particular, for smaller N, 
A ~ 1 and is barely decreasing. The dependence of C[4>] on <f> may be significant (and, 
possibly, diverging for ill-behaved targets). However, from Eq. (^9|), the dependence on 77 
near 2rj — 1 — » +0 is easier to analyze: 



~ (2r? - 1 



,1/27,-1 



(32) 



with an undetermined value at 2rj = 1. So for rj — ► 1/2, D approaches extensivity in iV, 



and then becomes ill-defined, again signaling inconsistency. As discussed by [Bialek et al. 
( |2001| ), problems with V(N) /N — > const are the most complicated correctly posed learning 
problems that exist, and they can be studied in the Bayesian QFT setting. 

For comparison with the nested case (cf . Appendix |B|), we may replace <fi by its Fourier 



series, 



(x\a) = — log Q(x\a) = ao + (a^ cos2ir fix + a /t sin27r/jx) , (33) 



«o = log / cfe exp 



(a^ cos 2tt [ix + sin 2'Kfix) 



(34) 



with r — > 00 (finite r results in a finite parameter model). The last equation enforces nor- 
malization, J (ix Q(x|a) = 1, and it is equivalent to the constraint <5( / exp[— ^(x|a)]dx — l) 
in the prior V(a\r) or V[4>{x)\. Since the Jacobian of the transformation 4>{x) — > {a^} 
i s a constant, Eq. (p^) a mounts to zero-mean Gaussian priors over with the variance 



( Nemenman and Bialek , 2002| ) 



±\2\ 



(27T/X) 2 "' 



/x > 0. 



(35) 



Equations ( |23| , pq ) suggest that the nested and nonparameteric case are similar: the 
a priori means of the amplitudes are zero, and the variances fall off as power laws in \i. 
However, the field theory model requires the variance to decrease at least as fast as 1/fj, (re- 
call that f] > 1/2), while the finite parameter case does not impose such constraints. This 
is an indication of an essential difference between the models: in the nested case, the pri- 
ors, specifically the a priori variances of parameters, have less of an influence on learning. 
This can be easily explained. QFT nonparametric models do not have a sharp separation 
between active and passive modes. The modes with low \i are determined by the data, 
but fluctuations for larger \± are inhibited only due to the small a priori variances, Eq. (^5|). 
The exact attenuation of the fluctuation depends on the values of 77 and £, and the cumula- 
tive contribution to posterior variance of the estimator may be substantial. In contrast, for 
the finite parameter nested case, once the most probable model family is determined, fluc- 
tuations of the higher order parameters are inhibited exponentially (cf. Appendix |A|). The 
cumulative fluctuations are then small and almost independent of the a priori parameter 
variances, and the learning may succeed even for V(r) with a long tail. 

The dependence of the QFT model on the prior can be weakened by treating I as an un- 
known random variable and averaging over it (Bialek et al, 1996: |Nemenman and Bialek , 
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2002 ) (similar averaging over 77 has not yet been performed). This is akin to nesting of 
finite dimensional models and improves learning curves for a wide range of targets. On 
the other hand, integration over t produces the theory that is not necessarily local in 4>(x), 
couples all of the Fourier amplitudes, and is difficult to compare to the nested finite pa- 
rameter setup directly. Therefore, we do not discuss the averaging in what follows, but 
assume that the values of rj and £ used for learning are the best for a particular target being 
learned. 



3.5 Comparing the performance 

One of the goals of the paper is to decide if learning curves can be used to distinguish 
which learning machine is a good description of a particular biological system. To this 
extent, we need to analyze responses of various learners to data that they are not expecting. 
Thus in this section we derive learning curves for a finite parameter, a nested ((3 = 0), and a 
QFT machine on data that is typical in the prior of one of the two others. With mismatched 
data and expectations, the learning curve can not be optimal, but may come quite close. 

First, consider Q taken from the QFT prior. The learning curve for the finite parameter 
model is given by Eq. — a N -1 decay towards some approximate target. Further, as 
shown in Appendix the learning curves for complete nested models, Eq. (|66|), and for 
the QFT machine, Eq. ([3ll), which is the best possible machine for such data, differ only 
logarithmically. 

If instead we study a distribution that is typical in the nested case for some f (equiva- 
lently, a finite parameter distribution with K (f) parameters) then a finite parameter model 
again gives Eq. (|T3|). On the other hand, for N < K(f), no complete learning machine can 
estimate all required unknown parameters, and A does not have a well defined scaling 
( Nemenman and Bialek| , [20021) . The differences between the machines emerge for N S> 



r. 



The nested machine eventually asymptotes to Eq. ([26|), and starts learning at the rate of 
1/N. However, the QFT setup performs differently: when A — » and all f modes are well 
approximated, the machine continues trying to fit higher order modes, which it expects to 
be present even though they are not. This will result in the same fluctuation determinant 
as in Eq. (j30"l), switching to the usual asymptotic A oc (JV '.f) 1 / 2 ' 7-1 instead of 



So, surprisingly, when the target has a finite number of degrees of freedom, the nested 
setup is qualitatively faster than the QFT learning machine! 



4 Learning a changing target 

One never needs to know the distribution that generated the data to an infinite precision, 
and some e > approximation is usually enough. Further, if learning in biological systems 
is stochastic, as argued, for example, by |Seung ( |2003| ), then e is bounded from below by 



the noise variance. As shown by [Fairhall et al. ( |2001[ ) and especially by pallistel et al 



( |2001| ), convergence to the "good enough" estimate happens so quickly, that the transient 



learning curves are difficult to resolve. Is then the performance difference between the 
nested and the QFT scenarios seen in the previous section important? And can it be used 
to discriminate between the models? 
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4.1 Model density and variable stimuli 

Notice that often the target itself changes while being learned. The ambient light intensity 
may be fluctuating while our eye estimates it, or the variance of angular velocities mea- 
sured by a fly motion sensitive neuron can be varied by an experimenter while the fly tries 
to adapt to it ( f airhall et al, , 2001 ). In these cases one has to learn constantly to stay at the 



allowed e-error, and then a faster learning machine may be truly advantageous. However, 
even for a variable target, the nested learner will not be helpful if (a) a small change of 
the target parameters throws it back to a very large A, or (b) the changing target may drift 
to a region where f is so large that the nested setup is not better than the nonparametric 
anymore. 

To answer these concerns, instead of focusing on the density of solutions as a function 
of the allowed error e, we will keep e fixed and vary a. For some small e, a schematic 
drawing of dependence of p on and a\ with the other parameters fixed at is shown 
on Fig. 0. In the nested case, there is a ridge along a\ pa 0, where the density is, at least, 
~ 1 /y/e larger than anywhere else, cf. Eq. (Q). The ridge comes from the prior, Eq. (|l8|), for 
a\ = being singularly larger than for a\ / 0, and the singularity is then smoothed out 
by e-approximation. In comparison, the nonparametric prior has a bivariate normal shape, 
which after e-smearing results in a weak target dependency of e-independent prefactors 
in Eq. (^9|); thus p(a) varies slowly. 6 

Figure [l] answers both of the concerns mentioned above. For a QFT machine, densities 
everywhere are comparatively small. So a small change of the target means vast and slow 
relearning. In contrast, if, for a nested case, a is in the large density region, then there are 
many other models in the vicinity. Small parameter changes likely leave the target close, 
and not much needs to be relearned. Further, since the ridge drops off smoothly, models 
in the vicinity of a large density target also have large densities, and thus are learned fast 
as well. Of course, this holds only when the target, indeed, varies mostly along a small set 
of directions, and density ridges are aligned with those. Importantly, since at a finite e the 
ridge has a finite width, a perfect alignment is not necessary. 

We believe that many natural signals have such structure. For example, in phototrans- 
duction, instantaneous intensity is determined by the statistics of reflectivities of objects 
that come in the view and by the mean ambient light intensity. The statistics barely change 
over long time scales, while the mean intensity depends on, for example, clouds shading 
the sun and varies a lot and rapidly. The photoreceptor may want to adapt to intricate 
details of the distribution of reflectivities, but only after it accurately learns the mean light 
level. A similar separation of time scales is observed in transcriptional regulation, where, 
for example, changes in the lactose concentration happen on the scale of minutes, while 
statistics of lactose bursts depends on the environment and is constant for generations. In 
neuroscience, when estimating an angular velocity, a fly takes into the account the preced- 
ing velocity variance (Fairhall et al., 2001[ ), but it may not have time for reaction to higher 



6 The plots of V{a) and p(e; a) have very different meanings. The volume under the V{a) surface is fixed 
by normalization, f dap(a) = 1. Thus high a priori probability on any singular line, e. g., ol\ — 0, necessarily 



means a lower prior elsewhere. Such considerations are the reason for no free lunch theorems (Wolperl, 1995). 
In the language of the model density, the normalization condition is J de p(a; e) = 1. However, there are no 
constraints on the density integrated over a, and a large density for some target does not necessarily result in 
a lower density elsewhere. 
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Figure 1: Schematic density of models as a function of the target location. 



order moments. Thus we believe that many natural learners that have a need to learn fast, 
but also to be able to learn a very wide class of models accurately on longer time scales, will 
be organized as nested learning machines with the density ridges approximately adjusted 
to fast variable directions. 



4.2 Fluctuation-dissipation and determining the model 

The prediction in the last Section brings us back to the main question of this work: how 
can an underlying learning-theoretic computation be inferred? For many reasons, analy- 
sis of learning curves is not always a good idea. First, learning may happen so fast that 
resolving it might present a problem ( |Gallistel et al, , 2001 ). Second, to estimate A(iV) reli- 



ably, we need to average, and a complete instance of the learning curve is just one sample. 
Such averaging may require prohibitively long experiments. Third, it is well know that an- 
imals adapt. Thus eliciting the same response to the same target requires large inter-trial 
time delays, further increasing the experimental duration. These problems can be traced 
to learning being an inherently transient behavior, and they might become less severe if we 
can characterize learning machines by some stationary response properties. A hint comes 
from the Fluctuation-Dissipation Theorem in statistical physics flM4 |1985 ), which states 



that, if a system fluctuates in the presence of a linear dissipative restoring force, then the 
variance of fluctuations (a stationary property) is linearly related to the dissipation coef- 
ficient (a feature of the transient response). In our case, we may hope that response to a 
variable target (fluctuations) reveals information about the learning curve (dissipation). 
In view of this suggestion, let us now analyze a few examples of a variable target learn- 
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ing. 7 We now denote by a an estimate of a averaged over many presentations of the same 
data. We keep almost all parameters fixed (or changing very slowly), while a\ (t), which is 
approximately the direction of the ridge in the density of solutions, is allowed to vary. If 
data are observed for a long time, then a M a M for fj, ^ 1 (provided a = a). Now remem- 
ber that A is the expected Kullback-Leibler divergence between a and a, which converges 
to the x 2 distance when it is small. Thus if a\ — a\ is not large, 

A oc [a\ — ai) 2 . (36) 

For a fixed target and A — ► (that is, for A?" — ► oo, a = a), all learning curves we 
studied can be summarized as 

§ = < 37 > 

Here, in particular, v = 1 corresponds to a finite set of solutions along the direction of 
a\, v = 2 is the finite-parameter or nested case, and v = 3 is the r\ = 1 QFT model. In 
principle, other values of v G (0; oo) are possible. The constant Ctv ~ 1 depends on the 
details of the learning setup. For example, for parametric cases, Cn = 2/K(f). 

For Eq. (|3^), which is manifestly true for a fixed a, to also hold in the fluctuating target 
case, the learning machine must quickly notice the target's variation and disregard old 
samples as soon as they become outdated. Pallistel et al. ( 2001 ) show that a rat reacts to 



changes in the reward rates as fast an ideal detector would. Therefore, this assumption is 
reasonable for biological systems. 8 

If measurements are taken at a fixed rate, so that dN/dt = const, we can combine 



Eqs. (|36j g7p to get 

^ = -Csign(A)|A| 2 ^ 1 -^, (38) 

where A = <X\ — <5i is the average error of the estimation, £ is some unknown constant 
with the dimensionality of 1 jt and is basically the scaled sampling rate, and w a is the drift 
velocity of the target. Equation ( p8| ) is a clear example of a dissipative system, and it has 
many analogues in the theories of classical and quantum dissipation flWeiss| , [1995| ). Note 
also that, unlike in the fluctuation-dissipation analysis in statistical physics, the spectrum 
of fluctuations, u s , is not necessarily white and can be controlled by an experimentalist, 
potentially providing more ways to probe the underlying dissipative dynamics. 

If the target's variation cannot be learned (incomplete or mismatched machine), then 
Eq. ( j38l ) still holds. However, because of Eq. (|l3|), we now have A = a\ — a\ (recall that a. 
is the best approximation to the target by a particular learning machine). Thus to trace the 
evolution of a\ using Eq. ([38|), one would need to evaluate a(a), which can be done from 
the stationary target analysis. Further, if the target varies along many learnable directions, 
then for each such direction we have an analog of Eq. d38|), possibly with different £. So the 
dynamics of A is still given by Eq. ( p7| ) with forcing, but the dissipation constant depends 
on the number of varying parameters. 

7 It is clear that stretching the theory of learning a fixed target to the fluctuating case may hide many po- 
tential pitfalls. We do this because we a re unaware of any comprehensive treatments of th e latter problem 



[though some progress is being made, cf . DeWeese and Zadoi _( 199S|); |Atwal and Bialek ( 2004 1] 



We leave aside important comments by DeWeese and Zadoi ( 19981), who argued that time needed to notice 



a change may be not invariant with respect to the direction of the change. 
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Let's now consider a few different examples of Va- If Va = A is a constant, then asymp- 
totically for t — ► oo, setting dA /dt = 0, we find 

A - Ao, = - I . (39) 

The ratio v^/C must be <C 1, otherwise A is outside of the A — > asymptotic, for which 
Eq. ( ^7| ) is valid. Thus, for small drifts, setups with smaller v win qualitatively. 

It is also of interest to consider the situation when dt\ undergoes a Brownian motion, 
(va{t)va{t')} = Q6(t — t'). Writing the Fokker-Planck equation for this Langevin dynamics, 
we easily find the stationary distribution of A, 

£ \ 1/(21") f ^|^|2^ 



™ = t®\&) ex H-^ri' (40) 

which results in the rms fluctuations of 

— HffifGT- 

Again, these results are true only if A rms <C 1, and again smaller v provides for better 
trailing of the target. 

Finally, inspired by [Fairhall et al. (2001), let's examine the case of a periodic motion of 
a\ and take, for simplicity, a\ = Asinujt, and v a = Auicosuit. Now Eq. (|38|) does not have 
a simple solution. However, we search for an asymptotically periodic A(t) with the same 
angular frequency of u. Therefore, if we multiply Eq. @) by cos tot, integrate over a full 
period, and exchange the order of the differentiation and the integration, we get 



dt 



-C(sign(A) \A\ 2u ~ l coswt) - Au (cos 2 ut) , (42) 



where (. . .) denotes averaging over the period. Since we are looking for stationary oscilla- 
tions, time derivative applied to any average is zero. This gives 



( S ign(A)|A| 2 ^ 1 cos^) = -^. (43) 
Now multiplying Eq. ( ^8| ) by sign(A) A 2l/_1 and averaging again results in 

(|A| 4 ^ 2 ) = (44) 

which is the same scaling as in Eq. (|39"1). However, now we also have a dependence on uj. 

There are other cases that can be analyzed, such as a step jump in the target, a\, its 
square wave modulation, or its diffusion in a potential (Ornstein-Uhlenbeck process). In- 
terestingly, the last two of these cases were used experimentally by |Fairhall et al.| ( [200 1| ). 
However, we leave the analysis for the future, when it will be answering some specific 
question and won't be just a mathematical exercise. Even with the three examples already 
discussed, it is clear that letting the target move maps the scaling of the learning curve 
into a stationary property (e. g., variance of the estimation error), which might be easier to 
analyze experimentally. 
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5 Discussion 



We have shown that, with a moving target, transient learning curves are replaced by differ- 
ent scaling dependences of the estimation errors on the amplitude of the target's motion. 
This effect is stationary and may be easier to observe experimentally. However, since we 
do not have a comprehensive theory of variable target learning yet, a few precautions are 
in order when designing and analyzing experiments along these lines. (1) Target veloci- 
ties must be kept small, so that the asymptotic analysis presented in this work holds. (2) 
The analysis is only valid when the learner forgets past observations as soon as the target 
changes appreciably. Learning will be much slower if such outdated samples are kept. (3) 
When varying the stimulus, we have to be reasonably sure that the animal only tracks it, 
but does not predict it. White noise v& or a multiparameter representation of the target in 
terms of the position, velocity, acceleration, etc., might be a solution. (4) Finally, we have 
to keep in mind that, in a behaving animal, learning a change in a signal and reacting to 
it may be separated by a long delay, and special care is needed to observe the former, but 
not the latter. This being said, it nevertheless is possible that all these and other disadvan- 
tages will be outweighed by the ability to determine the correct learning-theoretic model 
of the organism by varying the amplitude and the nature (say, stochastic or periodic) of 
the target's motion and studying typical responses as functions of these parameters. 

Consider, for example, the experiment described on Fig. 4 of |Fairhall et al. ( 2001 ). There 
the input signal (the standard deviation of the angular velocity, <r(t)) undergoes a finite 
variance S 2 and a finite correlation time r random motion. The instantaneous neuron 
firing rate r(t) is the estimate of cr(t). Repeating exactly the same randomly generated 
stimulus many times and averaging over spike trains, one may estimate r(t) and, conse- 
quently, the rms estimation error A rms = ((a(t) — r{t) 2 ) l J 2 . Studying dependence of A rms 
on £ and r along the lines of Eq. (41), one can estimate v. Any v ^ 2 uniquely determines 
the underlying computational model. For v = 2, to distinguish a usual finite parameter 
model from the one that is nested, one makes the signal multidimensional (other param- 
eters of the angular velocity, such as the mean and the skewness, vary together with a). 
For, at least, some signal extensions, the nested model will change the magnitude (but not 
the scaling) of A rms since ( oc 1/K{f). In contrast, the simpler model will keep the same 
prefactor but will be converging only to an approximation of the target. 

In cognitive experiments of |Gallistel et al. ( 2001 ), a rat was trying to learn reward rates 
on different terminals and match its foraging habits correspondingly. It was determined 
to be an ideal change detector. Now to build a more detailed model of the animal, one 
can vary the reward rates continuously, repeat experiments many times, and then look at 
the average mismatch between the stimulus and the response. Then dependence of the 
mismatch on the parameters of the rate changes will point at a proper class of learning- 
theoretic models to compare the rat to. Similarly, one can do this type of analysis on ar- 
tificial neural networks designed explicitly to model particular animal behavior flSeung] , 
2003[ ); this will build connections between network architectures and types of inference 
tasks performed by them. 

Another conclusion of our work is that the nested setup may learn faster than the QFT 
one under some conditions. Thus if one desires a complete learning machine, a nested 
machine should be built unless there is some specific reason to do the opposite (such as 
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knowing that the world is unlikely to have sharp cutoffs). With experiments along the 
lines suggested above, this prediction should be testable. We should be able to see if our 
intuitive beliefs about appropriate complexities of learners for particular tasks match the 
Nature's choices. It would also be interesting to study if structural characteristics of a 
learner are correlated with its learning-theoretic description. That is, could it be that mod- 
ular, irregular networks, like those seen in biochemistry, often compute like parametric or 
nested machines? And could layered, regular networks in our brains, which are believed 
to be able to solve the most complicated learning problems, be realizing QFT machines 
instead? 



A Model family selection in the nested setup 



Inference in Bayesian i. i. d. setup is quite standard (|Press|, |1989|: Bialek et al. , 1996 ; Bala- 
subramanian, |1997[ ; |Bernardo| , |2003| ; [Raftery and Zheng| , |2003| ), and nested case is not very 
different. For example, a posteriori expectations of parameter values are given by a deriva- 
tive of the posterior moment generating function (or the partition function), Z(3): 



Z(3) 
Z r {3)r 



d 



dJ„ 



log^(J), 



j=o 



^P(r)Z r (J)r, 

d K{r) a e-^M+SS -W 



N 



N 



i=i 

K(r) 

fl=l 8=1 

■ log Q(x\a) . 



C r (a) 



lx a 



(45) 

(46) 
(47) 

(48) 

(49) 

(50) 
(51) 



The posterior expectations are thus determined by the properties of the Z(3), which can 
be calculated using the saddle point analysis for N S> 1. This is difficult for the first form 
of Z{3), Eq. (p6|), due to the singularity at = [the singularity was also the reason why 
we left V(a) out of the combined Lagrangian, Eq. (^9|)]. Hence we return to the nested 
form, Eq. ([|7[ f|8|), but the equivalence between the representation should be kept in mind. 
Exchanging the order of integration and summation in Eqs. p6| , fl7| ) and similar is possible 
if the priors decay sufficiently fast at r — > oo, or are regularized with regularization lifted 
after averages are calculated. Unless mentioned otherwise, this is always assumed. 

The expectation of in the model families with K(r) < \x is necessarily zero, and a 
similar bias towards smaller magnitudes of parameters will be present when we average 
over families. Therefore, the a priori decrease of the variances with \i, Eqs. ( ^P) , 23), will 



persist a posteriori for finite N. This is the famous James and Stein ( 1961[ ) shrinkage. 
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The saddle point, also called classical or maximum likelihood, values of parameters in 
each family, a* = {a*}, and the second derivatives matrix at the saddle, F r , are deter- 
mined by (remember that a* = for fi > K(r)) 9 



d£ r (ct) 



d 2 C T {cx) 



da^da u 

To the first order ml/N, this gives 



0,H<K(r), (52) 
Fr , W < K(r) . (53) 



Z W = E^(0 ™^^ Q(MI^)el^ lj ^-< , (54) 



N 

where [i < K{r) components of J r are the same as those of J, and all higher order compo- 
nents are zero. Differentiating, we get: 

V R i a* e~ £(r ) 

{a " )= El^ • < 55 > 

K(r) N K( ) N F 

C(r) = - logP(r) - J] logp(a*. r ) + ^ tffoK) + -li log — + Tr log ^ . (56) 

/i=l i=l 



For finite R, and /3 = 0, this is the usual Bayesian model family selection: a posteriori ex- 
pectations are weighted sum over posterior probabilities of families defined by e~ c ^ . This 
posterior includes the negative maximum likelihood term, Yli=i •K^il *)' which grows in 
magnitude linearly with N, but decreased as r grows due to nestedness. It also incorpo- 
rates the fluctuation determinant ^-tp- log ^- + Tr log which grows logarithmically in 
N, but increases with r. Depending on the value of N, there will be some r* , for which 
C(r) is minimal. For large N, as a discrete analog of the saddle point argument, this value 
will dominate the sums in Eq. (|55|), hence some model family will be "selected." 

However, Eqs. ( |55| , |56| ) become more interesting if one lets R — > oo. The completeness 
condition ensures that for large enough r one will be overfitting the data, and Q{x\a*) — ► 
1/N S(x — Xi). Therefore, if the sums are dominated by r — > oo, then consistency breaks 
and the learning fails. One would thus expect two features to influence the success of the 
learning. First, it is the prior V(r), which switches on extra degrees of freedom: for slowly 
decaying priors one would expect r — > oo terms to win. Second, it is the dependence of 
the likelihood term on r, which measures how capable are the newly activated degrees of 
freedom of overfitting, or, equivalently, how fast maxj Q(xi\a*) grows. 

From Eq. ( p6| ) it is easy to see that large r will have an exponentially small weight in 
the posterior probability if 

N maxilog Q(xi\a*) + logV(r) 

lim — at = • ( 57 > 

r^oo K [rj log iV 



9 The are possibilities of more than one saddle point and of other anomalies. This was analyzed by B ialek 
et al. ( [2001^ . The conditions to prevent such problems are mild, and we assume them to hold in what follows. 
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Under this condition, Q(x\a*) will eventually approach the correct distribution, but not 
the sum of ^-functions. Colloquially, Eq. ( ^ ) requires the explanatory capacity of the new, 
high order degrees of freedom to be small enough so that keeping them always "on" does 
not make sense. This criterion, which we have not seen explicitly presented anywhere be- 
fore, is similar to the consistency condition of the Structural Risk Minimization (SRM) the- 
ory, which requires that the Vapnik-Chervonenkis dimension, the SRM capacity measure 
of the selected model, grows slower than the number of samples to be explained flVapnik , 
1998] INemehmarj [2OU0[ ). 



As an example, let's analyze how the condition in Eq. (p7\) may be violated for K(r) ~ r . 
In this case V(r) must be superexponential to be relevant for finding r*. Thus it is not 
required to decay at some minimal speed as might have been expected, though a need to 
exchange the order of integrations and summations in arriving to Eq. ( p6| ) may still force 
that. Due to light tails and small effective support, exponentially decaying priors are not 
very interesting, so we disregard the prior term in Eq. (p7|). Then, for a fixed large N, a 
finite r* will be dominant if log Q(xi\a*)/r — ► 0. That is, the growth of the 5 function-like 
peaks of the maximum likelihood distribution should be superlinear in K, the number of 
parameters in the model family, in order for r* — > oo and Bayesian setup to be inconsistent. 



B Fourier polynomials nested model 

To compare nonparametric and finite parameter nested scenarios directly, we analyze the 
following example. Consider families of probability distributions periodic on [0, 1), and 
with the logarithms of the distributions given by Fourier polynomials of degree r < oo, 
as in Eq. (f33|). Due to the normalization condition, Eq. (|34|), the number of parameters in 
the r'th model family is K(r) = 2r. With an appropriate choice of priors, Eqs. ( fl4j , |l5|), 
these families form a nested set, and the completeness for R — > oo follows from the Fourier 
theorem. 

The classical solution for this parameterization is (1 < fj, < r) 

% +£ ( Z ) 2 " m - N I d -Q(-\"'i ( c s " ) 2 "f- - 

Here are the cosine (sine) amplitudes of the /x'th mode in the Fourier expansion of 
Q{x\a), and are the same for the empirical probability density, l/N ^ 5(x — X{). 
are also the stochastic Fourier transform of Q{x). The cosine-cosine components of the 
second derivative matrix at the saddle point are 



d 2 c 



da^daj 



-^rr + N [ dxQ(x\a*) cos 2itfix cos 2-kvx 
®fi J 

-N j dx Q(x\a*) cos27r/ux J dyQ{y\a*) cos2T\vy (59) 



+ T (Q*% + + yQ^Q*/ , (60) 
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and the sine-sine and the sine-cosine components are written similarly. This matrix is 
provably positive definite. Thus for N — > oo we can perform the saddle point analysis. 

For (3 = 0, the variance a 2 is constant, and we can neglect the first term in Eq. ( p8| ) in 
the limit of large N. This leads to the following solution of the saddle point equations: 

Q^»A± M = l...r. (61) 

For f3 > 0, Q*^ will be corrected by a systematic /^-dependent bias, which will tend to 
for fixed [i as N grows. This will decrease the posterior variance of the estimator Q*. 

Equation ( |6ll ) says that the first r pairs of coefficients a*^ are such that the correspond- 
ing Fourier amplitudes of the classical solution Q* match those of the empirical one. By 
Nyquist theorem and the law of large numbers, for r < N/2, approach the Fourier 
amplitudes of the unknown target probability density Q. Thus the low frequency modes 
will be learned well. However, if r > N/2 the saddle point solution will start to overfit and 
develop <5-like spikes at each observed data point. This is in accord with the observation 
we have already mentioned: to guarantee consistency, the capacity of models, as measured 



by either the VC dimension or the scaling dimension of |Bialek et al.| fl2001[ ), which in this 
case is equal to the number of free parameters, must grow slower than N. 

To avoid overfitting when averaging over r, we must make sure that the contribution of 
r — ► oo to the posterior log-probability, Eq. (|56|), is negligible. In this regime, according to 
Eq. (^|), the r available modes will create peaks of height ~ r (recall the Fourier expansion 
of the 5-function) at the observed sample points. With K(r) = 2r, this ensures consistency 



by satisfying Eq. fl57|) . 



Further, we can prove that r* is not only finite, but actually grows sublinearly in N, 
agai n paralleling results for SRM ( Vapnikj, [1998|) and their Bayesian equivalent (N emen- 



man, [2000D . Suppose r ^> N dominates the posterior. Then, for a slowly decaying V(r), 



Eq. ( p6| ) can be rewritten as 

C(r) ~ -iVlogr + rlogiV, (62) 

This is minimized (and the posterior probability is maximized) for r* ~ N/ log N, and 
higher values of r are exponentially inhibited. Thus the assumption of r S> N being dom- 
inant is incorrect, and the posterior probability is dominated by r* < N for all reasonable 
priors. This is, of course, the worst case estimation, and in many typical applications the 
value of r* is even lower. 



C Fourier nested model and QFT targets 

As shown above, r* that minimizes Eq. ( |56|) for a Fourier nested setup is much smaller than 
N. This is true for any target, including QFT-typical targets. For r of such magnitude, the 
first r modes of the target are well approximated by the estimate, and they contribute 
0{r/N) to the leading data dependent term in Eq. ([56]). The modes of the target above 
the r'th are not fitted by the estimate, and each of them contributes its variance of about 
^+1^-27, to the data dependent term, adding up to £~=r+i tr 2r '+V" 2 ' 7 oc {rl)~ 2r i +l . 
Combined with the fluctuation determinant this gives 

C(r) ~ -N (riy 2ri+1 + r log N (63) 
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for determining the most probable r. Thus for N 3> 1 (or A<1) 

' N 

iogiv 



AT \ 1/2*7 

r * oc ( __ j f/iv-i f an d (64) 



V oc ( l ^L) l ~ 1/2 \ (65 ) 



Due to the many simplifications made here, the exact form of the logarithmic terms in these 
expression is questionable, 10 and, in practice, they are impossible to observe for realistic N 
due to the target-dependent prefactors in front of the universal scaling term and various 
statistical fluctuations. However, the power law in Eq. (|66|), which is definitely correct, 
suggests that the performance of the nested model is comparable to that of the true QFT 
one. In particular, the nested learning machine also can solve arbitrarily complex inference 
problems. 

A rigorous way to estimate performance of the nested learning on a nonparametric 
target is to calculate (p(e)) = J daV(a:)p(e;a), where p is of the form Eq. (24), and the 
averaging is done over the QFT prior, and then calculate V and A from this averaged 
p. This is difficult, and instead we may choose to replace (p(e)) by ^(ejatyp), where 
Qi t yp is a typical target in the nonparametric prior, Eq. (f^). 11 For such a typ , D r (a) ~ 
E M >r ^ 2v+ 1 ^ 2v « (tr)- 2 ** 1 . Further, V(a± typ \r) ~ exp[-0.5 p- 2r >£-^ +1 /a 2 ,]. In our 
case, a 2 ~ p~ 2f3 . Therefore, for 2(rj — 0) > 1, which is satisfied for r] > 1/2 and (3 = 0, 



this gives V(a r \r) ~ exp 
and C2 are constants. For 



" £jl=l vr 2r >t- 2 ^ x lal ~ exp [d - C 2 r- 2 ^-P)+ 1 ], where d 
arge enough r, this whole expression tends to a constant. Thus, 
combining with Eq. (^), we get 

^_ 27r r \e - (£r)~ 2r i+i-](r-i) 

If P(r) is subexponential as before, we get /) to the leading order in small e by calculating 
the sum in Eq. ( |67| ) using the saddle point analysis and taking just the zeroth order term. 



10 If certain derivative of the target distribution satisfies some Lipschitz conditions, then the Occam factor 
and the lear ning curve for histogramming density estimators provably have logarithmic contributions (H all 
and Hannan, [L988| ; |Rissanen et al.[ , |1992[ ). In contrast, logarithmic corrections for QFT models and for paramet- 
ric learning of QFT-typical targets have not yet been analyzed. However, the logarithmic differences between 
the cases have been expected: in discrete case, /3 = 0, once we know K* ~ N u , each of K* parameters is free 
to vary with the same variance, giving familiar iV" log iV fluctuations. For the nonparametric c ase, a u < a v 
for fi > v. Thus each next parameter varies less, somewhat decreasing the total fluctuations ( foalek et al 



2001) 



These logarithmic terms have the same roots as the difference between cross-validation, bootstrap, and 
Akaike's model s election criterion on one ha nd and Dawid's prequential statistics and Bayesian model selec- 
tion on the other (Stone, 197^ : Dawid, 1984). There the difference in the magnitude of the prediction error is 



also due to most of the parameters that are active at a given TV being latent for smaller sample sizes. 

n The benefit (p(e)) provides over p(e; a tyP ) is knowing the prefactors in V and A. We don't believe that 
any of the priors studied in this work will be exactly realized in nature. Therefore, calculation of (p(ej) is not a 
priority. 
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The saddle value for r is r* ~ e 1 ^ 2ri ^£ 1 , which gives 

p(e; a ty pi ca i) ~ e , (68) 

with the first subleading term of O (exp [— e -1 ' ( 2j? ~ 1 ).£~ 1 ]). Doing the leading order evalu- 
ation of the integral in Eq. (||), we now get e* ~ (N&/ log N) 1 ^ 2 ^ 1 , which again results in 
Eq.©. 

In summary, learning a distribution typical in the nonparametric model by means of 
the nested setup results in, at most, a logarithmic performance loss. 
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